feat(RFC): Adds `altair.datasets` #3631

dangotbanned · 2024-10-04T18:57:00Z

Tracking

Waiting on the next vega-datasets release.
Once there is a stable datapackage.json available - there is quite a lot of tools/datasets that can be simplified/removed.

3.0.0 Release vega-datasets#654

Discovered a bug that makes some handling of expressions a little less efficient.

[Bug]: Missing handling for Iterator[IntoExpr] narwhals-dev/narwhals#1897

Upstreaming some nw.Schema stuff to narwhals

[Enh]: nw.(DType|Schema) conversion API narwhals-dev/narwhals#1912

Description

Providing a minimal, but up-to-date source for https://github.com/vega/vega-datasets.

This PR takes a different approach to that of https://github.com/altair-viz/vega_datasets, notably:

No datasets are included in the package
- Instead, several metadata files form a dense summary of vega-datasets/datapackage.json
- 3 files provide a reduction (70-150kb -> 15kb) and optimized views for this particular use-case
- Includes redundancies for missing dependencies
Strong support for typing
- Annotations are generated from the metadata itself
- https://github.com/vega/altair/blob/9e9deeb95668d2c4e7d30311e85a8f9f6acdc88c/altair/datasets/_typing.py
So far, 4 backends have been implemented, instead of only pandas
- These provide precise IDE completions, with a lot of help from https://github.com/narwhals-dev/narwhals
Users can opt-out of caching remote dataset requests
- With the "polars" backend, the slowest I've had on a cache-hit is 0.1s to load
  - https://cdn.jsdelivr.net/npm/[email protected]/data/flights-200k.json

Examples

These all come from the docstrings of:

Loader
Loader.from_backend
Loader.__call__

from altair.datasets import Loader

load = Loader.from_backend("polars")
>>> load
Loader[polars]

cars = load("cars")

>>> type(cars)
polars.dataframe.frame.DataFrame

load = Loader.from_backend("pandas")
cars = load("cars")

>>> type(cars)
pandas.core.frame.DataFrame

load = Loader.from_backend("pandas[pyarrow]")
cars = load("cars")

>>> type(cars)
pandas.core.frame.DataFrame

>>> cars.dtypes
Name                       string[pyarrow]
Miles_per_Gallon           double[pyarrow]
Cylinders                   int64[pyarrow]
Displacement               double[pyarrow]
Horsepower                  int64[pyarrow]
Weight_in_lbs               int64[pyarrow]
Acceleration               double[pyarrow]
Year                timestamp[ns][pyarrow]
Origin                     string[pyarrow]
dtype: object

load = Loader.from_backend("pandas")
source = load("stocks")

>>> source.columns
Index(['symbol', 'date', 'price'], dtype='object')

load = Loader.from_backend("pyarrow")
source = load("stocks")

>>> source.column_names
['symbol', 'date', 'price']

- Allow quickly switching between version tags #3150 (comment)

To support [flights-200k.arrow](https://github.com/vega/vega-datasets/blob/f637f85f6a16f4b551b9e2eb669599cc21d77e69/data/flights-200k.arrow)

Not required for these requests, but may be helpful to avoid limits

As an example, for comparing against the most recent I've added the 5 most recent

See https://docs.github.com/en/rest/repos/repos?apiVersion=2022-11-28#list-repository-tags

- Basic mechanism for discovering new versions - Tries to minimise number of and total size of requests

Experimenting with querying the url cache w/ expressions

- `metadata_full.parquet` stores **all known** file metadata - `GitHub.refresh()` to maintain integrity in a safe manner - Roughly 3000 rows - Single release: **9kb** vs 46 releases: **21kb**

https://github.com/vega/altair/actions/runs/11495437283/job/31994955413

- Still undecided exactly how this functionality should work - Need to resolve `npm` tags != `gh` tags issue as well

Opened (narwhals-dev/narwhals#1897) Marking (#3631 (comment)) as resolved

- Shorter names `Read`, `Scan` - The single unique method is now `into_scan` - There was no real need to have concrete classes when they behave the same as parent

Inspired by https://github.com/pypa/packaging/blob/8510bd9d3bab5571974202ec85f6ef7b0359bfaf/src/packaging/requirements.py#L67-L71

Resolves: ```py File ../altair/.venv/Lib/site-packages/pyarrow/csv.pyx:1258, in pyarrow._csv.read_csv() TypeError: Cannot convert dict to pyarrow._csv.ParseOptions ```

Also simplifies and removes outdated `Extension`-related tooling

- Reduced the scope a bit, now just un/supported - Added `pprint` option - Finished docs, including example pointing to use `url(...)`

altair/datasets/__init__.py

dangotbanned · 2025-01-31T20:33:15Z

altair/datasets/_cache.py

+    # TODO: Open an issue in ``narwhals`` to try and get a public api for type conversion
+    def schema_pyarrow(self, name: _Dataset, /):
+        schema = self.schema(name)
+        if schema:
+            from narwhals._arrow.utils import narwhals_to_native_dtype
+            from narwhals.utils import Version
+
+            m = {k: narwhals_to_native_dtype(v, Version.V1) for k, v in schema.items()}
+        else:
+            m = {}
+        return nw.dependencies.get_pyarrow().schema(m)


TODO

Open an issue in narwhals to try and get a public api for type conversion ([Enh]: nw.(DType|Schema) conversion API narwhals-dev/narwhals#1912)

feat: add nw.Schema.to_* methods narwhals-dev/narwhals#1924

- Will be even more useful after merging vega/vega-datasets#663 - Thinking this is a fair tradeoff vs inlining the descriptions into `altair` - All the info is available and it is quicker than manually searching the headings in a browser

Resolves #3631 (comment)

altair/datasets/_reader.py

…cutils] See https://altair-viz.github.io/user_guide/generated/theme/altair.theme.RowColKwds.html Discovered this fix in #3631 (comment)

#3631 (comment)

Contains a fix (narwhals-dev/narwhals#1934) for #3631 (comment)

narwhals-dev/narwhals#1934

See narwhals-dev/narwhals#1888

@MarcoGorelli

Possible since narwhals-dev/narwhals#1930 @MarcoGorelli if you're interested what that PR did (besides fix warnings 😉)

dangotbanned added 6 commits October 2, 2024 22:13

wip

7933771

feat(DRAFT): Minimal reimplementation

b30081e

refactor: Make version accessible via data.source_tag

279586b

- Allow quickly switching between version tags #3150 (comment)

refactor: ext_fn -> Dataset.read_fn

32150ad

docs: Add trailing docs to long literals

f1d18a2

docs: Add module-level doc

4d3c550

dangotbanned added the maintenance label Oct 4, 2024

dangotbanned added 23 commits October 4, 2024 20:15

Merge branch 'main' into vega-datasets

7e65841

Merge branch 'main' into vega-datasets

05773af

Merge branch 'main' into vega-datasets

4fff80a

feat: Adds .arrow support

3a284a5

To support [flights-200k.arrow](https://github.com/vega/vega-datasets/blob/f637f85f6a16f4b551b9e2eb669599cc21d77e69/data/flights-200k.arrow)

feat: Add support for caching metadata

22a5039

feat: Support env var VEGA_GITHUB_TOKEN

a618ffc

Not required for these requests, but may be helpful to avoid limits

feat: Add support for multi-version metadata

1792340

As an example, for comparing against the most recent I've added the 5 most recent

refactor: Renaming, docs, reorganize

fa2c9e7

feat: Support collecting release tags

24cd7d7

See https://docs.github.com/en/rest/repos/repos?apiVersion=2022-11-28#list-repository-tags

feat: Adds refresh_tags

7dd461f

- Basic mechanism for discovering new versions - Tries to minimise number of and total size of requests

feat(DRAFT): Adds url_from

9768495

Experimenting with querying the url cache w/ expressions

fix: Wrap all requests with auth

c38c235

chore: Remove DATASET_NAMES_USED

a22cc8a

feat: Major GitHub rewrite, handle rate limiting

1181860

- `metadata_full.parquet` stores **all known** file metadata - `GitHub.refresh()` to maintain integrity in a safe manner - Roughly 3000 rows - Single release: **9kb** vs 46 releases: **21kb**

feat(DRAFT): Partial implement data("name")

31eeb20

fix(typing): Resolve some mypy errors

511a845

Merge branch 'main' into vega-datasets

c76cfd4

Merge branch 'main' into vega-datasets

d3f0497

Merge branch 'main' into vega-datasets

1b3390b

fix(ruff): Apply 3.8 fixes

a770ba9

https://github.com/vega/altair/actions/runs/11495437283/job/31994955413

docs(typing): Add WorkInProgress marker to data(...)

686a485

- Still undecided exactly how this functionality should work - Need to resolve `npm` tags != `gh` tags issue as well

Merge branch 'main' into vega-datasets

ba4491d

Merge branch 'main' into vega-datasets

1a4e107

dangotbanned added 11 commits January 30, 2025 13:37

chore: add workaround for narwhals bug

e68ab89

Opened (narwhals-dev/narwhals#1897) Marking (#3631 (comment)) as resolved

feat(typing): replace (Read|Scan)Impl classes with aliases

576a9b4

- Shorter names `Read`, `Scan` - The single unique method is now `into_scan` - There was no real need to have concrete classes when they behave the same as parent

feat: Rename, docs unwrap_or -> unwrap_or_skip

91562d5

refactor: Replace ._contents w/ .__str__()

1628cbd

Inspired by https://github.com/pypa/packaging/blob/8510bd9d3bab5571974202ec85f6ef7b0359bfaf/src/packaging/requirements.py#L67-L71

fix: Use correct type for pyarrow.csv.read_csv

cbd04e3

Resolves: ```py File ../altair/.venv/Lib/site-packages/pyarrow/csv.pyx:1258, in pyarrow._csv.read_csv() TypeError: Cannot convert dict to pyarrow._csv.ParseOptions ```

docs: Add docs for Read, Scan, BaseImpl

c0a92a6

docs: Clean up _merge_kwds, _solve

2b8bf5e

refactor(typing): Include all suffixes in Extension

755ab4f

Also simplifies and removes outdated `Extension`-related tooling

feat: Finish Reader.profile

0ba3d67

- Reduced the scope a bit, now just un/supported - Added `pprint` option - Finished docs, including example pointing to use `url(...)`

test: Use Reader.profile in is_polars_backed_pyarrow

845b3ee

feat: Clean up, add tests for new exceptions

869d216

dangotbanned commented Jan 31, 2025

View reviewed changes

altair/datasets/__init__.py Outdated Show resolved Hide resolved

dangotbanned commented Jan 31, 2025

View reviewed changes

feat: Adds Reader.open_markdown

7bb6f9e

- Will be even more useful after merging vega/vega-datasets#663 - Thinking this is a fair tradeoff vs inlining the descriptions into `altair` - All the info is available and it is quicker than manually searching the headings in a browser

dangotbanned mentioned this pull request Feb 1, 2025

[Enh]: nw.(DType|Schema) conversion API narwhals-dev/narwhals#1912

Open

dangotbanned added 2 commits February 1, 2025 20:55

docs: fix typo

760eb66

Resolves #3631 (comment)

Merge remote-tracking branch 'upstream/main' into vega-datasets

94220be

dangotbanned commented Feb 2, 2025

View reviewed changes

altair/datasets/_reader.py Outdated Show resolved Hide resolved

dangotbanned added a commit that referenced this pull request Feb 3, 2025

fix: _typing.RowColKwds:2: ERROR: Unknown target name: "generic". [do…

2433b82

…cutils] See https://altair-viz.github.io/user_guide/generated/theme/altair.theme.RowColKwds.html Discovered this fix in #3631 (comment)

dangotbanned added 3 commits February 3, 2025 21:55

fix: fix typo in error message

cc6d757

#3631 (comment)

Merge remote-tracking branch 'upstream/main' into vega-datasets

1b64392

Merge remote-tracking branch 'upstream/main' into vega-datasets

2bd89aa

dangotbanned added a commit that referenced this pull request Feb 5, 2025

ci: bump narwhals>=1.25.1

1ea8faf

Contains a fix (narwhals-dev/narwhals#1934) for #3631 (comment)

dangotbanned mentioned this pull request Feb 5, 2025

ci: bump narwhals>=1.25.1 #3792

Merged

dangotbanned added 6 commits February 5, 2025 17:57

Merge remote-tracking branch 'upstream/main' into vega-datasets

193fabd

refactor: utilize narwhals fix

6c93eb0

narwhals-dev/narwhals#1934

refactor: utilize nw.Implementation.from_backend

790ff10

See narwhals-dev/narwhals#1888

feat(typing): utilize nw.LazyFrame working TypeVar

8e53848

Possible since narwhals-dev/narwhals#1930 @MarcoGorelli if you're interested what that PR did (besides fix warnings 😉)

Merge remote-tracking branch 'upstream/main' into vega-datasets

e7f7ba8

docs: Show less data in examples

2c3b44d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(RFC): Adds `altair.datasets` #3631

feat(RFC): Adds `altair.datasets` #3631

dangotbanned commented Oct 4, 2024 •

edited

Loading

dangotbanned Jan 31, 2025 •

edited

Loading

feat(RFC): Adds altair.datasets #3631

Are you sure you want to change the base?

feat(RFC): Adds altair.datasets #3631

Conversation

dangotbanned commented Oct 4, 2024 • edited Loading

Related

Tracking

Description

Examples

dangotbanned Jan 31, 2025 • edited Loading

Choose a reason for hiding this comment

TODO

feat(RFC): Adds `altair.datasets` #3631

feat(RFC): Adds `altair.datasets` #3631

dangotbanned commented Oct 4, 2024 •

edited

Loading

dangotbanned Jan 31, 2025 •

edited

Loading